﻿ Towards interoperability of discourse annotaon Use cases of corpora Dan Cristea dcristea@info uaic ro Computer Science Faculty, “Alexandru Ioan Cuza” University of Iași and Instute of Computer Science, the Iași Branch of the Romanian Academy Text versus discourse Syntaccally – a discourse is more than a single sentence Gabriel Garcia Marquez: “The Autumn of the Patriarch” EUROLAN-2015, Sibiu, 23-24 July Text versus discourse A text is not a discourse! But it becomes a discourse the very moment it is read or heard by a human or a machine EUROLAN-2015, Sibiu, 23-24 July Discourse level interconnecvity? • Suppose we are interested to search corpora for paerns of rhetorical argumentaon – ﬁnd similar argumentaon lines in mullingual corpora • Or similar sequences of events in books – to what degree two plots are alike (conceptual plagiarism)? EUROLAN-2015, Sibiu, 23-24 July Why is it diﬃcult to achieve the discourse level of interconnecvity? • There is no established representaon of discourse level phenomena • Discourse representaon is diﬃcult to achieve even by humans • Large diversity of opinions with respect to discourse level structures • Errors accumulate in processing pipelines: discourse structures are the result of a long chain of modules that have to work together EUROLAN-2015, Sibiu, 23-24 July Issues Part I – a quick overview on some discourse maers Part II: The MASC experiment Part III: The QuoVadis experiment Part IV: The MappingBooks experiment 6 EUROLAN-2015, Sibiu, 23-24 July Part I A quick overview on some discourse theories • Some discourse models we need to know about – RST – Centering – Veins Theory EUROLAN-2015, Sibiu, 23-24 July Rhetorical Structure Theory (RST) (Mann & Thompson 1987, Moser & Moore 1996, Carlsson et al 2001, Asher & Lascarides 2003) • A coherent discourse is described by a tree – elementary discourse units • usually clauses or single-clause sentences – discourse relaons link adjacent spans of text • paratacc (mulnuclear) • hypotacc (nucleus-satellite) EUROLAN-2015, Sibiu, 23-24 July Structures are trees 1 Farmington Police had to help control traﬃc recently 2 when hundreds of people lined up to be among the ﬁrst applying for jobs at the yet-to-open Marriot Hotel 3 The hotel’s help-wanted announcement – for 300 openings – was a rare opportunity for many unemployed 4 The people waing in line carried a message of claims that the jobless could be employed if only they showed enough moxie 5 Every rule has excepons, 6 but the tragic and too-common tableaux of hundreds of people snake-lining up for any task with a paycheck illustrates a lack of jobs, 7 not laziness 1-7 background 1-3 4-7 volional result evidence 2-3 5-7 4 circumstance concession 3 2 5 6-7 anthesis Sibiu, 23-24 July 6 7 EUROLAN-2015, Centering - a theory of local discourse coherence • Joshi,A K and Weinstein,S , 1981: “Control of Inference: Role of Some Aspects of Discourse-Structure Centering“ • Grosz,B ; Joshi,A K and Weinstein,S , 1986: “Towards a 1995: computaonal theory of discourse interpretaon” instein,S,the Local • Brennan,S E ; Friedman,M W and Pollard,C J , 1987: “A Centering A K and Wer Modeling approach to pronouns“ ,B ; Joshi,ework foutaonal • Grosz,B ; Joshi,A K and Weinstein,S, 1995: “Centering: A Groszering: A Framse”, Comp framework for modeling the local coherence of discourse” “Cente of Discour • Strube,M and Hahn,U , 1996: “Funconal Centering“ Coherenc • Walker,M A ; Joshi,A K and Prince,E F (eds ), 1997: “Centering in Linguiscs, 21/2 Discourse“ • Kameyama,M , 1997: “Intrasentenal Centering: A Case Study“ 10 EUROLAN-2015, Sibiu, 23-24 July A crash course on Centering • Cf(u) = - an ordered list • Cb(u) = ei • Cp(u) = e1 C(u) = C(u-1) C(u) ≠ C(u-1) bbbb C(u) = C(u) CONTINUING SMOOTH bp SHIFT C(u) ≠ C(u) RETAINING ABRUPT bp SHIFT • Transions: CON > RET > SSH > ASH 11 EUROLAN-2015, Sibiu, 23-24 July Transions: an example a Terry really goofs somemes Cf = ([Terry]) CONTINUING b Yesterday was a beaful day and he was excited about trying out his new sailboat Cf = (he=his=[Terry], [the sailboat]) CONTINUING Cb = [Terry] c He wanted Tony to join on a sailing expedion Cf = (he=[Terry], [Tony], [the expedion]) Cb = [Terry] CONTINUING d He called Tony at 6 A M Cf = (he=[Terry], [Tony]) SMOOTH SHIFT Cb = [Terry] e Tony was sick and furious at being woken up so early Cf = ([Tony]) Cb = [Tony] EUROLAN-2015, Sibiu, 23-24 July Veins over discourse structure (Veins Theory – VT) (Cristea, Ide and Romary, 1998), (Ide and Cristea, 2000) • Oﬀers a beer understanding of the relaonship between discourse structure and referenality • Independent of segment granularity • Generalises Centering from local to global • Assists anaphora resoluon processes • Guides discourse parsing • One applicaon: summarizaon and, especially, focused summaries EUROLAN-2015, Sibiu, 23-24 July VT trees as in RST relations nuclear 4 1 labeled units H = 1 9 * V = 1 9 * H = 1H = 9 V = 1 9 *V = 1 9 * 2 3 H = 1H = 5H = 9 V = 1 9 *V = 1 5 9 *V = 1 9 * = 1H = 3H = 6 7H = 9??-??H V = 1 9 *V = 1 3 5 9 *V = 1 5 6 7 9 *V = 1 9 * 5H = 9 13-?? V = 1 (8) 9 * 123467H = 10 8 V = 1 9 10 * H = 11 = 39H V = 1 9 10 11 * V = 1 3 5 9 = 1 3H = 910DRA V = 1 (8) 9 1112 DRA = 1 8 9 EUROLAN-2015, Sibiu, 23-24 July The reason why she can refer Mary but not John’s mother 1 John told Mary that he loves her 2 He has never been married 3 As such, he lived until his 40s with his mother 4 She, on the contrary, was married twice 1÷4 elaboration V=1 2 4 1 1 antithesis 4 elaboration 4 2 3 2 3 EUROLAN-2015, Sibiu, 23-24 July Significance of pronouns changes dictated by the structure 1 With one year before finishing his mandate as president of the company, 2 Mr W Ross has begun to bring about its bankruptcy 3 There were rumors that he has obtained it by fraud 1÷3 circumstance background 1 V=2 3 1 2 3 2 3 EUROLAN-2015, Sibiu, 23-24 July Significance of pronouns changes dictated by the structure 1 Mr W Ross has begun to bring about the bankruptcy of his company 2 with one year before finishing his mandate as president 3 There were rumors that he has obtained it by fraud 1÷3 circumstance 1 V=1 2 3 1 background 3 2 3 2 EUROLAN-2015, Sibiu, 23-24 July Compung veins Two steps: – Compute heads of each node – Compute veins of each node 18 EUROLAN-2015, Sibiu, 23-24 July Heads • Head expression of a node: the sequence of the most important units within the corresponding span of text: – the head of a terminal node: its label – the head of a non-terminal node: the concatenaon of the head expressions of the nuclear children – the important units are projected up to the level where the corresponding span is seen as a satellite 19 EUROLAN-2015, Sibiu, 23-24 July Compung heads H=3 H=3 5 H=1 2 H=5 H=3 H=1 1 2 3 4 H=2 H=4 H=3 20 EUROLAN-2015, Sibiu, 23-24 July Veins Vein expression of a node: the sequence of units that are signiﬁcant to understand the span of text covered by the node, in the context of the whole discourse to understand a piece of text in the context of the whole discourse one needs the signiﬁcant units within the span together with other surrounding units 21 EUROLAN-2015, Sibiu, 23-24 July Compung veins • Vein expressions are computed top-down starng with the root – the vein expression of the root is its head expression – 4 other rules are applied 22 EUROLAN-2015, Sibiu, 23-24 July Heads and veins H=3 H=3 V=3 V=3 5 H=1 2 H=5 V=3 5 V=1 2 3 H=3 H=1 V=1 2 3 V=(1 2) 3 1 2 3 4 H=2 H=4 H=3 V=1 2 3 V=(1 2) 3 V=3 4 23 EUROLAN-2015, Sibiu, 23-24 July Domain of referenal accessibility (DRA) The vein expression of a terminal node (discourse unit): the sequence of units that are signiﬁcant to understand just that unit, in the context of the whole discourse References of anaphors are to be found only to their le => DRA(u) = pref(u, vein(u)) (simplified) 24 EUROLAN-2015, Sibiu, 23-24 July Types of references vein expression evocative references coreference chain - evocave resoluon processes: - an anaphor may be resolved to a referent that is not linearly the closest, but only hierarchically the closest - based on associaons (paern matching on morpho-semanc features) - fast - give ﬂuency to the text EUROLAN-2015, Sibiu, 23-24 July Types of references vein expression post-evocative references coreference chain - post-evocave resoluon processes: - are inferenal processes developed in memory, - computaonally and cognively slow (compel to more inference load), - require more powerful referencing means (like proper nouns), - are less frequent EUROLAN-2015, Sibiu, 23-24 July The ﬁrst conjecture (on cohesion) Antecedents of anaphora are found mainly in the DEA given by the vein expression of the unit the anaphor belongs to Only if they are not found there, the rest of the units must be searched (Extended-DEA) Unit Vein DEA E-DEA 4 1 3 4 5 1 3 4 2 1 3 4 EUROLAN-2015, Sibiu, 23-24 July Evaluang the coherence of a discourse • A smoothness score: – CONTINUING = 4 – RETAINING = 3 – SMOOTH SHIFT =2 – ABRUPT SHIFT = 1 – NO Cb = 0 • A global smoothness score: summing up the score of all units EUROLAN-2015, Sibiu, 23-24 July The second conjecture (on coherence) • The global smoothness score of a discourse when computed following VT is at least as high as the score computed following CT • But segments, as considered by Centering, typically are developed along veins • When passing segments frontiers, in a linear reading, transitions are usually abrupt • Therefore, what we claim here is that long- distance transitions, as computed along veins, are systematically smoother than accidental transitions at segment boundaries EUROLAN-2015, Sibiu, 23-24 July Potential to establish correct coreference links • Compare Linear-k and Discourse-VT-k models: – For each k, each re, and each model M (Linear or VT) re can be resolved to antecedents in DRA • p(M-k,re,DRA) = 1, k k { 0, otherwise • p(M-k,Corpus) = ∑rep(M-k,re,DRA) ∈Corpus k EUROLAN-2015, Sibiu, 23-24 July Potential: higher 95 00% 90 00% 85 00% 80 00% 75 00% 70 00% 0 1 2 3 4 5 6 7 8 E-DEA size9 VT-k Linear-k EUROLAN-2015, Sibiu, 23-24 July Eﬀort required to ﬁnd antecedents • Compare Linear-k and Discourse-VT-k models: – For each k, each re, and each model M (Linear or VT) d 20 annotaon types – All annotaons manually generated or automacally generated and hand-validated – Wrien and spoken data from 19 genres – Collaborave development involving contribuons of annotaons from the community – Freely available from hp://www anc org EUROLAN-2015, Sibiu, 23-24 July Discourse annotaon: how diﬃcult? • Expensive – experts in discourse structure theories (RST) are rara avis – takes me, tedious • Subjecve – give the same text to 2 expert annotators => 2 structures – oen, tree structures are very brushy => diﬃcult to handle EUROLAN-2015, Sibiu, 23-24 July Discourse annotaon: what to do? => let people take only easy decisions: • Clauses and discourse cue-words (expressions) – automacally marked (using tools from hp://nlptools info uaic ro/) • Hand validated by three annotators • Nucleus-satellite relaons given on markers – only the nuclearity – no need to name them EUROLAN-2015, Sibiu, 23-24 July The Guidelines: clauses (elementary discourse units) • a sequence of words belonging to a sentence and including, at minimum, a predicate and an explicit or implied subject, and expressing a proposion, statement, idea, etc • a clause is headed by a main verb or a verbal compound, considered "pivot”; clause boundaries occur between pivots – verb compounds are sequences of more than one verb, including one main verb plus auxiliaries and inﬁnives: EUROLAN-2015, Sibiu, 23-24 July The Guidelines: markers • Discourse markers are words or group of words that signal clause boundaries and/or rhetorical relaons between two text spans – not any two clauses are separated by markers: EUROLAN-2015, Sibiu, 23-24 July Markers: nuclearity • Nuclei: more salient to the discourse structure • Satellites: provide supporng informaon – nuclei are more essenal to the writer’s purpose than satellites – satellites are oen incomprehensible without the nucleus, whereas a text where the satellites have been deleted can be understood to a certain extent • NUC (required): showing the posioning of the two arguments with respect to the marker: N N, N S, S N, NN, NS, SN, NN , NS , SN Example (displayed in GATE) Inline XML rendering I recently returned to the US from Australia The 14-hour flight took me from Monday morning in Sydney to Monday morning, again, in L A Crossing the date line messed up my sense of time enough without the added bonus of thinking I should be heading to bed just as the sun began to climb into the California sky EUROLAN-2015, Sibiu, 23-24 July What can cue-phrases tell about structure? (Marcu, 1997, 2000; Cristea et al , 2003, 2005) [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] From (Cristea and Webber,1998) because , because -, - 1 2 3 4 1 2 3 1 2 2 3 4 2 3 4 11 1 2 3 1 2 3 4 S N EUROLAN-2015, Sibiu, 23-24 July What can cue-phrases tell about structure? [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] whenever , whenever -, - 2 3 4 2 3 2 3 4 2 3 S N EUROLAN-2015, Sibiu, 23-24 July What can cue-phrases tell about structure? [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] , for example 2 3 -, - for example1 1 2 3 3 22 1 3 N S EUROLAN-2015, Sibiu, 23-24 July What can cue-phrases tell about structure? [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] Two trees resulted after considering all constraints: because -, - because -, - -, - for example 4 1 whenever -, - 1 whenever -, - -, - for example 4 2 3 2 3 because , because , whenever , whenever , , for example , for example Possible structures, as implied by markers [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] because -, - because -, - right whenever -, - -, - for example 4 1 wrong example 1 whenever -, - -, - for4 2 3 2 3 EUROLAN-2015, Sibiu, 23-24 July How can references help in discovering the structure? [Because John is such a generous man 1] [– whenever he is asked for money, 2] [he will give whatever he has, for example 3] [– he deserves the “Citizen of the year” award 4] because -, - because -, - whenever -, - -, - for example 4 1 example 1 whenever -, - 4 -, - forDEA=14 2 3 DEA=2 4 2 3 DEA=1 2 3 DEA=1 2 DEA=2 3 DEA=1 2 EUROLAN-2015, Sibiu, 23-24 July Categories of MASC discourses 1 Class E (erroneous) – when the set of constrains imposed by markers is contradictory, yielding NO valid tree structure Due to errors in the annotaon of discourse markers 2 Class D (determined) – showing just one structure 3 Class U (underdetermined) – showing more than one discourse structure, but all having equal-DRAs on terminal nodes 4 Class R (referenality-based) – showing more than one discourse structure, with not equal-DRAs on terminal nodes In principle, these should be structures which can be decided (further reﬁned) based on referenality constrains and heuriscs What shall come out of this? • Some stascs: – the percent of classes E, D, U and R structures computed on the whole corpus; – the degree of underdeterminism: how many trees do contain the class U structures? (if many, then our constraints system and the hypothesis feeding it make unnecessary a lot of reﬁning work); – the degree of laxity of class R structures: how many trees do contain the class R structures? These are the texts/sentences in MASC on which the structures cannot be decided by applying combinatorics Measures to compare similarity of discourse trees • The Vein Score: compares discourse trees on the ground of VT scores: – ILi – no of idencal labels in the vein expression of the edu i in test Tree and Gold tree; – Ti – total no of labels of the vein expression for edu i in Test tree; – Gi – total no of labels of the vein expression for edu i in Gold tree; – N – total no of edus EUROLAN-2015, Sibiu, 23-24 July Measures to compare similarity of discourse trees • The DRA Score: compares discourse trees on the ground of VT-DRA scores: – ILi – no of idencal labels in the DRA expression of the edu i in Test tree and Gold tree; – Ti – total no of labels of the DRA expression for edu i in Test tree; – Gi – total no of labels of the DRA expression for edu i in Gold tree; – N – total no of edus EUROLAN-2015, Sibiu, 23-24 July Could MASC give evidence for an underdetermined representaon of the discourse structure? And what to do with the rest? Combining discourse and coreferenality annotaon • discourse markers linking clauses and sentences • discourse enes (markables): named enes (Organizaon, Locaon, Person), bare nouns (noun chunks), pronouns (including possessive forms) • coreference chains linking markables EUROLAN-2015, Sibiu, 23-24 July Parsing based on applying the Right Froner Constraint and heuriscs that exploit VT’s claim on coherence • As an attachment constraint in an incremental development of a tree structure: – adjunction of an auxiliary tree on a node of a developing tree can be made only on a node of the right frontier of the developing tree – which one? The one that matches best the coreferential needs of the new material nodes Incremental discourse parsing - a TAG inspired approach Adjoining to the right frontier (Polanyi, 1988) τ a τ’ a * σa a 1 σaσ σ01 σ0 σ1 EUROLAN-2015, Sibiu, 23-24 July 7 Substitution in case of free expectations k Although Bill would have wanted it, k+1 John sold his bicycle to somebody else ’ τ τ k+1 k+1 k ò k EUROLAN-2015, Sibiu, 23-24 July Expectaons – examples On the one hand John is very generous On the other he is extremely diﬃcult to ﬁnd On the one hand John is very generous On the other, suppose you needed some money You’ll see he is very diﬃcult to ﬁnd From Cristea&Webber, 1999 EUROLAN-2015, Sibiu, 23-24 July Expectations-driven incremental parsing (Cristea and Webber, 1998) EVIDENCE EVIDENCE a Clinton is bound to win the elections a ANT-CONS b He is a natural born campaigner b ANT-CONS c If you hold some position on an issue, c e d then if Clinton wants to get your vote, d e he will assure you with great sincerity that he holds that position too EUROLAN-2015, Sibiu, 23-24 July 8 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner EVIDENCE b * a EUROLAN-2015, Sibiu, 23-24 July 9 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some posion on an issue, EVIDENCE EVIDENCE c * a b EUROLAN-2015, Sibiu, 23-24 July 10 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some posion on an issue, EVIDENCE a EVIDENCE b c EUROLAN-2015, Sibiu, 23-24 July 11 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, EVIDENCE a EVIDENCE b ANT-CONS c ? EUROLAN-2015, Sibiu, 23-24 July 12 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some posion on an issue, EVIDENCE EVIDENCE * ANT-CONS a b c ? EUROLAN-2015, Sibiu, 23-24 July 13 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, d he will assure you with great sincerity that he holds that posion too EVIDENCE a EVIDENCE b ANT-CONS c ? EUROLAN-2015, Sibiu, 23-24 July 14 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, d he will assure you with great sincerity that he holds that position too EVIDENCE a EVIDENCE b ANT-CONS c d EUROLAN-2015, Sibiu, 23-24 July 15 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, d then if Clinton wants to get your vote, EVIDENCE a EVIDENCE b ANT-CONS ANT-CONS c? d ? EUROLAN-2015, Sibiu, 23-24 July 16 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, d then if Clinton wants to get your vote, e he will assure you with great sincerity that he holds that posion too EVIDENCE a EVIDENCE b ANT-CONS c ANT-CONS d ? EUROLAN-2015, Sibiu, 23-24 July 17 Expectations-driven incremental parsing a Clinton is bound to win the elections b He is a natural born campaigner c If you hold some position on an issue, d then if Clinton wants to get your vote, e he will assure you with great sincerity that he holds that posion too EVIDENCE a EVIDENCE b ANT-CONS c ANT-CONS d e EUROLAN-2015, Sibiu, 23-24 July 18 Strategies-based parsing • H1 – Favorising lower adjuncon levels: give beer scores to adjuncons at the lower part of the RF EUROLAN-2015, Sibiu, 23-24 July Strategies-based parsing • H2 – Opening the minimum referenality domain: if the material node m contains a reference that can be sasﬁed by the domains D1, Dk, give beer scores for the domains having less referents This heuriscs favors upper heuriscs adjuncon levels EUROLAN-2015, Sibiu, 23-24 July Strategies-based parsing • H3 – Maximum referenality score: favor adjuncon posions where most referents ﬁnd antecedents on the referenality domains given by veins • zero propun: compulsory to have an antecedent on the vein • clic: antecedent extremelly wanted on the vein • overt pronoun: antecedent very much wanted on vein • common noun: good to have an antecedent on the vein • proper noun: not necessary that the antecedent be on the vein EUROLAN-2015, Sibiu, 23-24 July Strategies-based parsing • H4 – Maximum CT transions: favor adjuncon posions that give overall smoother understanding of the discourse (maximise the CT transions on veins) Scores of transions: • connuing = 4 • retaining = 3 • smooth shi = 2 • abrupt shi = 1 • no Cb = 0 EUROLAN-2015, Sibiu, 23-24 July Strategies-based parsing • H5 – Substute ﬁrst, if no other needs: whenever possible solve substuons ﬁrst EUROLAN-2015, Sibiu, 23-24 July Annotaon associated with markers • Suﬃcient to support the heuriscs: – Disncon between types of markers: • announcing expectaons • relang to old discourse EUROLAN-2015, Sibiu, 23-24 July Acknowledgements for Part II • Nancy Ide and her students at Vassar: for MASC • Oana Postolache, Daniel Anechitei, Elena Mitocariu: for anaphora resoluon, discourse parsing algorithms and measures of similarity • The FP7 ATLAS project: applying the discourse parsing algorithms on 6 languages: Bulgarian, English, German, Greek, Polish, Romanian EUROLAN-2015, Sibiu, 23-24 July Part III The QuoVadis experiment • The corpus and movaon • Convenons of annotaon • How to build a large corpus with low human resources • What do we expect from it? EUROLAN-2015, Sibiu, 23-24 July QuoVadis: deciphering relaons between book characters • Twice: term research projects (Feb – Jun 2013 and Oct 2013 – Jan 2014) with master students in CL => building of a large corpus annotated for discourse enes and semanc relaons EUROLAN-2015, Sibiu, 23-24 July ‘QuoVadis’ corpus EUROLAN-2015, Sibiu, 23-24 July Corpus of enes and semanc relaons • Corpus encoding mentions of: – persons; – gods; – groups of persons and gods; – body parts of persons and gods; – semantic relations linking these entity types EUROLAN-2015, Sibiu, 23-24 July Enes • individuals (Marcus Vinicius, Lygia), groups (the Christians, the soldiers) and classes (the emperor); • syntactic realisation: NPs (determiners – a soldier, adjectives – young patrician, complement PPs included – the son of one consul; but no relative clauses; • included entities for Romanian language: [te] [iubesc; REALISATION=INCLUDED], Marcus vs 12 [I] love [thee], Marcus 21 • recursive referential expressions: [the adherents [of Christ]] are praying… 21 EUROLAN-2015, Sibiu, 23-24 July Idenﬁcaon of the arguments • recursive: – the anaphor/source is larger than the antecedent/ destination • non-recursive: – referential: • the anaphor is to the right of the antecedent – non-referential: • from source to destination reading the trigger EUROLAN-2015, Sibiu, 23-24 July Relaons • Anaphoric relations: co-referential • Non-anaphoric relations: – kinship – affective – social EUROLAN-2015, Sibiu, 23-24 July Anaphoric relaons • coref • coref-interpret • member-of, has-as-member (inverse) • isa, class-of (inverse) • part-of, has-as-part (inverse) • subgroup-of, has-as-subgroup (inverse) • has-name, name-of (inverse) Example: [Lygia] was unable to answer, for weeping seized [her] 12 anew Acte gathered [the maiden] to her bosom, and 3 strove to calm [her] excitement 4 coref ; coref-interpret ; coref EUROLAN-2015, Sibiu, 23-24 July Kinship relaons • parent-of • child-of (inverse of parent-of) • grandparent-of and grandchild-of (inverse) • sibling (symmetrical) • ant-uncle-of, nephew-of (inverse relation) • cousin-of (symmetrical) • spouse-of (symmetrical) • unknown Example: "Pardon me, Lygia For me thou art [ [of a king]] and [ [of Plautius]] “ 2143 child-of ; child-of EUROLAN-2015, Sibiu, 23-24 July Social relaons • superior-of • inferior-of • in cooperation-with • colleague-of • in competition-with • opposite-to Example: [Petronius]…but to [his] misfortune [he] 123 [Cæsar himself], hence [he] roused [his] jealousy 456 in competition-with ; coref ; coref ; coref EUROLAN-2015, Sibiu, 23-24 July Aﬀecve relaons • love • loved-by • hate • hated by • upset • friendship • worship Example: Vinicius entered Lygia's dungeon and remained there till daylight…Both changed by degrees into sad souls with [each] [other] 12 rec-love EUROLAN-2015, Sibiu, 23-24 July A complex example • cui i-ar fi putut trece prin minte că [un patrician], [nepot şi 1 [fiu de [consuli]]], ar putea să se găsească printre gropari 432 • Besides, into whose head could it enter that [a patrician], [the 1 grandson [of one consul]], [the son [of another]], could be 5276 found among servants, corpse-bearers • ro: coref , kinship:grandchild-of ; kinship:child-of ; • en: coref , coref , kinship:grandchild-of , kinship:child-of EUROLAN-2015, Sibiu, 23-24 July General stascs over the corpus • 7,281 sentences • 146,822 tokens, punctuation included • 171,029 tokens summed up under all relations • 24,636 entity mentions • 22,301 referential relations • 755 AKS relations (Affective + Kinship + Social) • 752 triggers EUROLAN-2015, Sibiu, 23-24 July Example: aﬀecve relaons love and worship in the corpus EUROLAN-2015, Sibiu, 23-24 July Example: aﬀecve relaons fear-of and hate in the corpus EUROLAN-2015, Sibiu, 23-24 July Vinicius’ links with other characters EUROLAN-2015, Sibiu, 23-24 July The distribuon of semanc relaons involving the main character Vinicius EUROLAN-2015, Sibiu, 23-24 July QuoVadis in the LLOD world - Developing techniques allowing to decipher the content of texts - idenfying characters and linking their menons - evidencing internal relaons that would allow: - intelligent searches in the content (for instant, the ﬁrst menon of a character), - connecons between enes (for instance, family trees), - visualise stascs about enes, etc ‘QuoVadis’ • http://nlptools infoiasi ro/Resources jsp EUROLAN-2015, Sibiu, 23-24 July Acknowledgements for Part III • Anca Bibiri, Cătălina Mărănduc, Paul Diac, Daniela Gîfu, Mihaela Colhon, Andrei Scutelnicu and master students in CL EUROLAN-2015, Sibiu, 23-24 July Part III – linking books in the virtual and real world: MappingBooks • Origins of the project • A mapped book… • Features of the technology • The architecture of the system • Conclusions EUROLAN-2015, Sibiu, 23-24 July MappingBooks Get out of the book in the virtual and real world! EUROLAN-2015, Sibiu, 23-24 July I like to read books and to travel… EUROLAN-2015, Sibiu, 23-24 July I need help to remember all kinship relaons between characters EUROLAN-2015, Sibiu, 23-24 July Characters in Forsyte Saga • The old Forsytes Ann, the eldest of the family Old Jolyon, the patriarch of the family, having made a fortune in tea James, a solicitor, married to Emily, a most tranquil woman Swithin, James's twin brother with aristocrac pretensions; a bachelor Roger, "the original Forsyte" Julia (Juley), a ﬂuery dowager; Mrs Sepmus Small Hester, an old maid Nicholas, the wealthiest in the family Timothy, the most cauous man in England Susan, the married sister • The young Forsytes Young Jolyon, Old Jolyon's arsc and free-thinking son, married three mes Soames, James and Emily's son, an intense, unimaginave and possessive solicitor, married to the unhappy Irene, who later marries Young Jolyon Winifred, Soames's sister, one of the three daughters of James and Emily, married to the foppish and lethargic Montague Dare George, Roger's son, a dyed-in-the-wool mocker Francie, George's sister and Roger's daughter, emancipated from God • Their children June, Young Jolyon's deﬁant daughter from his ﬁrst marriage; engaged to an architect, Philip Bosinney, who becomes Irene's lover Jolly, Young Jolyon's son from his second marriage; dies of enteric fever during the Boer Wars Holly, Young Jolyon's daughter from his second marriage, to June's governess Jon, Young Jolyon's son from his third marriage, to Irene, Soames's ﬁrst wife Fleur, Soames's daughter from his second marriage, to a French Soho shopgirl Annee; Jon's lover; later marries a baronet, Michael Mont Val, Winifred and Montague's son; ﬁghts in the Boer Wars; marries his cousin Holly Imogen, Winifred and Montague's daughter • Others Parﬁ, Old Jolyon's butler Smither, Aunts Ann, Juley and Hester's housekeeper Warmson, James and Emily's butler Bilson, Soames's housemaid Prosper Profond, Winifred's admirer and Annee's lover EUROLAN-2015, Sibiu, 23-24 July EUROLAN-2015, Sibiu, 23-24 July Going out of the book… Çelebi Mh , Maç Sk, Beyoğlu, Turkey to Çukur Cuma Cd, Beyoğlu, Turkey - Google Maps10/3/1310/3/13 8:13 PMKatip Çelebi Mh , Maç Sk, Beyoğlu, Turkey to Çukur Cuma Cd, Beyoğlu, Turkey - Google Maps 8:13 PMKatip Directions to Çukur Cuma Cd, Beyo!lu, Turkey 400 m – about 4 mins Walking directions are in beta Use caution – This route may be missing sidewalks or pedestrian paths Katip Çelebi Mh , Maç Sk, Beyo!lu, Turkey" Çukur Cuma Cd, Beyo!lu, Turkey" 1 Head southwest on Maç Sk toward Baltacı Çkgo 75 m About 47 secstotal 75 m These directions are for planning purposes only You may find that construction projects, traffic, weather, or other events may cause conditions to differ from the map results, and you should plan your route accordingly You must obey all signs or notices regarding your route Map data ©2013 Basarsoft 2 Turn right onto Turnacıba"ı Cdgo 28 m total 100 m 3 Turn left onto A!a Külhanı Sk (Altıpatlar Sk )go 130 m About 2 minstotal 240 m 4 Continue onto Çukur Cuma Cdgo 150 m About 1 mintotal 400 m Page 2 of 2https://maps google com/maps?f=d&source=s d&saddr=Maç+Sokak,+I…,288 55,2 369,37 281,0&layer=c&ei=OqVNUp3mE8nTtAaWr4CgCQ&pw=2 Page 1 of 2https://maps google com/maps?f=d&source=s d&saddr=Maç+Sokak,+I…,288 55,2 369,37 281,0&layer=c&ei=OqVNUp3mE8nTtAaWr4CgCQ&pw=2 EUROLAN-2015, Sibiu, 23-24 July Idea • Right now: each book – so many readers… • MappingBooks: I buy a book – wow, it is specially designed for me! EUROLAN-2015, Sibiu, 23-24 July Purpose • connect a reader of a book with events/locaons/persons in the real or virtual world, adaptable on her/his locaon and personal portrait • help eding houses sell beer their producons • fuel researchers and companies on Language Technologies with heavily annotated textual data EUROLAN-2015, Sibiu, 23-24 July A “mapped book” • a book connected with events/locaons/persons in the real and virtual world • the reader gets selecve informaon, depending on personal portrait (cultural and tourisc preferences, taken from social sites…) and instantaneous info (e g , locaon, as seized by the mobile/tablet) EUROLAN-2015, Sibiu, 23-24 July Use cases - I visit a town with a guide in my hand - places of interests are re-ranked depending on my posion - I am a high school boy, traveling by train from Suceava to Bucharest… - if I open my tablet and posion it by the right window, it indicates the Ceahlău mountain, as in my manual of Geography - I am in Paris for the third me… - but only now my MB Lonely Planet guide on Paris tells me about this exhibion in the Pyramid - I arrived in Cologne for the ﬁrst me and I open my tablet when I arrive in the Central staon… - my MB guide points me directly to the Dome EUROLAN-2015, Sibiu, 23-24 July Exploitaon of textual informaon in MappingBooks Aims 1) connect enes’ menons in the form of nominals (noun phrases) => one coreferenal chain corresponds to each enty; 2) no preliminary records about linked enes => the knowledge base evolves from scratch; 3) look specially for coreferenal (identy of enty menons) and geographical relaons (posion, distance, point-of, near, intersects, etc ); 4) texts under invesgaon: Geography manuals and traveling guides EUROLAN-2015, Sibiu, 23-24 July Enty linking • Challenges in enty linking: – name variaons – ambiguies – absence • enty • link type EUROLAN-2015, Sibiu, 23-24 July Enes in MB • Type PERSON • Type LOCATION • Type ORGANISATION • Type URL • Type TIMEX EUROLAN-2015, Sibiu, 23-24 July Textual realisaon of enes • Syntacc realisaon: NPs (proper nouns, common nouns, adjecves, complement PPs; but NO relave clauses) • Characterised by disncve heads – [the house on the [mountain]] • If intersected è imbricated – [the museum [Grigore Anpa]] EUROLAN-2015, Sibiu, 23-24 July Features we want to have • The capacity to see a text diﬀerent than a string of leers – sentence spling – tokenisaon – POS-tagging – lemmasaon – NP chunking – anaphora resoluon TEXT ANALYTICS EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Know who’s who – recognise names and types – disambiguate names – recognise an enty in the text even if menoned with a common noun or a pronoun – use an ontology of types NAME ENTITY RECOGNITION EUROLAN-2015, Sibiu, 23-24 July Features we want to have • What virtual world enes are menoned in the book – link textual menons of enes in the virtual world – decide what info from virtual would be relevant to user – use mulple sources ENTITY CROWLING EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Trace on Google Maps a spaal relaon described in the book – spaal relaons detecon in text – use Google Maps APIs or related free technologies – trace locaons and paths on maps RELATIONS DETECTION MAPS&TRAJECTORIES EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Fetch, process and make use of geo-data – Geographic Informaon Systems (GIS) – geographic layers GEOGRAPHY EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Know where I am • What real world enes are in my proximity – detecon of my posion – computaon of distances from the menoned places – signalling “interesng” locaons in proximity DEVICE INFO EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Mix images with generated info – locate the posion of the user (GPS) – sense orientaon of the camera (compas) – process images => segment, contours, recognion – decide info to be displayed AUGMENTED REALITY EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Aracve user interfaces – analyse use cases – design dedicated user interfaces – accommodate on the screen a segment of text, a map, user’s posion, web info, etc INTERFACES EUROLAN-2015, Sibiu, 23-24 July Features we want to have • Client-server – user’s Portrait – the databases – standards and communicaon protocols CLIENT-SERVER EUROLAN-2015, Sibiu, 23-24 July Other issues… • RESOURCES – ﬁnd the texts – clear IPR – perform annotaon – ﬁnd other relevant linguisc data EUROLAN-2015, Sibiu, 23-24 July TA = Text Analytics NER = Name Entity Recognition AR = Augmented Reality EC = Entity Crowling DEV = Device Info RD = Relations Detection INT = Interfaces GEO = Geography RES = Resources M&T = Maps and Trajectories M&E = Management and Evaluation EUROLAN-2015, Sibiu, 23-24 July MappingBooks are addressed to… • School children – to put books back in their hands (lost paradise), by boosng interacvity based on readings • Adolescents, adventures, travellers, randonée people (montagnards) – to socialize on common travels • Pensioners – to network based on common readings, cultural preferences • Researchers on LT & Computaonal Linguiscs – to get access to heavily annotated linguisc resources • Providers of textual data (eding houses, media companies, newspapers) – to beer sell their books • Local administraons and tourist agencies – to adverse places described in famous books • If in extensive use, MB could enhance the European common repository of language resources EUROLAN-2015, Sibiu, 23-24 July Towards live books • Mul-dimensional mash-ups combining textual, geographical and temporal data • Spot the book menons (persons and locaons) • Make heavy use of enty linking techniques => connecng enty menons onto the virtual world • Links sensive to: – the context of menons in the book – the current locaon of the reader – the moment the reader iniates an access – The personality of the reader EUROLAN-2015, Sibiu, 23-24 July Acknowledgements • MappingBools is a project supported by a grant of the Romanian Ministry of Educaon and Research, July 2014 – June 2016 • Our students in Computer Science, for developing a prototype of the system during their project in AI, in the Autumn – Winter term of 2013-2014… • My colleagues: Daniela Gîfu, Daniel Anechitei, Ionuț Pistol, prof Mihai Niculiță in the Dept of Geography EUROLAN-2015, Sibiu, 23-24 July Thank you! 130 EUROLAN-2015, Sibiu, 23-24 July 